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Abstract 

In this study, we analyze the Genetic Analysis Worl<shop 18 data to identify the genes and underlying single- 
nucleotide polymorphisms on 1 1 chromosomes that exhibit significant association with systolic blood pressure. We 
propose a novel family-based method for rare-variant association detection based on the hierarchical Bayesian 
framework. The method controls spurious associations caused by population stratification, and improves the 
statistical power to detect not only individual rare variants, but also genes with either continuous or binary 
outcomes. Our method utilizes nuclear family information, and takes into account the effects of all single- 
nucleotide polymorphisms in a gene, using a hierarchical model. When we apply this method to the genome-wide 
Genetic Analysis Workshop 18 data, several genes and single-nucleotide polymorphisms are identified as potentially 
related to systolic blood pressure. 



Background 

Current studies suggest that a large number of common 
variants identified in genome-wide association studies 
(GWAS) as being associated with various complex dis- 
eases can account for only a small portion of phenotype 
variation [1]. With the advent of next-generation 
sequencing, attention has focused on rare variants 
(RVs), such as single-nucleotide polymorphisms (SNPs) 
with a minor allele frequency (MAF) of less than 1%. 
Traditional single-marker methods lose statistical power 
for detecting RV association because of their rare occur- 
rence. In the last few years, however, a variety of meth- 
ods have been developed, including the combined 
multivariate and collapsing (CMC) method [2] and the 
weighted sum (WS) method [3]. More sophisticated 
methods that are robust to different variant effects 
include the kernel-based adaptive cluster (KBAC) 
method [4], the C-alpha test [5], and the sequence ker- 
nel association test (SKAT) [6]. 

These methods, however, all assume that individuals 
are independently sampled and are, therefore, vulnerable 



to the influence of population stratification. Exploring 
marker transmission within a family avoids the issues of 
population stratification. More important, once anRV 
enters a family, it can segregate to other family members 
so that copies of the minor allele are enriched in the 
data. This could potentially increase the statistical power 
of family-based approaches. 

Here we propose a novel family-based Bayesian collap- 
sing model (FBCM) capable of identifying associations 
of RVs and genes with quantitative phenotypes. The 
method builds on the hierarchical quantitative transmis- 
sion disequilibrium test (HQTDT) [7]. Compared to 
classical statistical methods, the Bayesian framework 
incorporates prior information, thereby providing an 
alternative approach to situations in which factors 
affecting the power of the test, such as the MAF of the 
SNP, play an important role [8]. We combine HQTDT 
with the idea of collapsing under a Bayesian framework. 
Then we expand the model in a data-driven manner by 
utilizing random effects to model the signals of indivi- 
dual rare variants within a gene. 
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Table 1 Most significant genes associated with SBP 



Chr EnsembI ID 
(ENSGOOOOO-) 


HGNC symbol 


BF' 


Effect 
size 


5 213568 


HNRNPA1P13 


8.52 


0.43 


19 127529 


OR7C2 


4.82 


-0.24 


15 259500 


RP11-138E16.2 


4.43 


-0.32 


Gene: HNRNPA1P13 








Chr Position 


BF'^ (against 
null) 


Index SNP 


MAP 


5 135754696 


36.84 


rs 139658064 


0.004 


5 135765214 


0.00368 


NA 


NA 



' BF in formula (2). 
^ BF in formula (3). 



Methods 

Family-based Bayesian collapsing model 

Several statistical models based on the Bayesian frame- 
work, such as model selection [9] and multiple regres- 
sion [10], have been proposed for RV association 
detection. Because of the large scale of model space and 
matrix calculation, these approaches suffer from imprac- 
tical computational time if a full joint posterior distribu- 
tion is required. Although most Bayesian methods 
endeavor to employ various optimization or approxima- 
tion algorithms to obtain a point estimate, the loss of 
uncertainty information on the estimate means the sig- 
nificance of the estimate cannot be evaluated. In this 
paper, we propose a Bayesian model that aims to effi- 
ciently generate a full posterior distribution without the 
loss of model space, and is viable for family-based gen- 
ome-wide association analysis. The central idea comes 
from collapsing RVs and modeling their effects using 
variant-specific random effects. In some cases, it is prob- 
able that an RV is enriched in certain pedigrees while 
being very rare in others. Thus, among a group of RVs, 
some can be both neutral and associated with the phe- 
notype through population stratification. To solve this 
problem, the RVs are collapsed in 2 orthogonal compo- 
nents to adjust for the possible population stratification. 

Consider a candidate gene that contains K diallelic 
loci (in this paper, a locus always refers to the location 
of a SNP) with MAP less than 1%. Given a set of i= I,..., 
M nuclear families, each of which contains k, siblings so 

that the total number of offspring is ^ n, = N, we 

define the coded genotypic score G,yAr for the child in 

the family as the number of minor alleles at the k^^ 

locus. It is assumed that both parents of each child are 

available, and, correspondingly, the genotypic scores of 

the parents at the A:* locus in the family are denoted 

by GMik and GFik, respectively, for the mother and 

father. Conditional on the parental genotypes, the 

expected score for the offspring in the f'^ family at the 

I th I 1 11., . „ GMih + GFih 
k locus under mendelian law is Gfi* = . 



Purthermore, the deviation of the genotypic score for 
the child in the family at the A:* locus, which is 
denoted by Diji^, is Giji^- GEn^. Por technical reasons, we 
add a pseudolocus k = Q and define G£,o = £>,yo = 0. 
When at least lof the parents carries the copies of 
minor alleles at a locus, it is then possible to observe 
deviation in offspring at this locus. However, given a 
moderate set of variants, it is very unlikely for an indivi- 
dual to harbor minor alleles at more than 2 causal var- 
iants. For instance, when MAP is 0.005 and there are 
50 independent causal RVs, the probability of an indivi- 
dual having minor alleles at more than 2 loci is 

1 - 0.99'° - 50 • 0.99" • 0.01 - -5^^ • 0.99** • 0.01^ «i 1.38%. ThuS, 

2 

by taking advantage of the rare occurrence of copies of 
minor alleles, for each individual we consider at most 2 
loci that have nonzero deviation in a candidate gene. 
These are indexed by r;y and S;y, which are defined 
below. 




fe, if individual j in family i has deviation at least at 1 locus 

0, and the loots with the smallest MAP is indexed by k, otherwise. 

k, if individual j in family i has deviation at more than 1 locus 

0, and the locus with the second smallest MAP is indexed by k, otherwise. 



This method dramatically shortens computational time 
by avoiding large-scale matrix computation in Gibbs sam- 
pling. If an individual has nonzero deviation at fewer than 
2 loci, both or s,y are 0. Those with the smallest MAPs are 
selected if an individual has more than 2 loci with nonzero 
deviation. Thus, more emphasis is placed on those with 
smaller MAPs because deleterious functional variants tend 
to have low frequencies [12]. Given that RVs often do not 
exhibit strong linkage disequilibrium (LD) with either rare 
or common SNPs [11], for a moderate number of RVs 
such approximation loses much less information than do 
naive collapsing methods. Moreover, including 2 loci 
enables the model to detect the additive effect combina- 
tion of 2 RVs. 

Let yij denote the quantitative phenotype for the 7* 
child in the i"^ family. The relationship between the 
phenotype and the set of RVs in the candidate gene can 
be expressed by a hierarchical model 

Kj = + /3l-(«r,. Djjrii + as,,. + fe.Cyrj.GEir, + yisj.GEjj,,) +<Pi+ €jj (1) 

0i~N(O,Or2) 
TipSij e 0, 1, 

where ^ is the global intercept and fi,y is the random 
error. The genotypic score is decomposed into within- 
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family and between-famUy components, and the construc- 
tion of formula (1) guarantees the orthogonality of those 2 
components. Inference based on jii provides a stratifica- 
tion-resistant within-family test, while ji2 estimates the 
genetic effect resulting fi-om stratification. As a result of the 
limitation of the sample size for the inference and the fact 
that the variance components are not our major interest in 
this study, the family-level variable (pi is modeled as a ran- 
dom effect. This enables us to capture the between-family 
variance that includes the influence of the family-specific 
environmental factor. 

The vectors of variant-specific random effects 
a = (ao, ai, • • • , ak) and y = {yo, Ki, • • • , Kk) modulate 
the within-family and between-family global effects Pi 
and 132, respectively, r;^ and S;y are individual-specific 
indices of which elements in a and y contribute to the 
child in the family. It is possible that some of the 
RVs are neutral, but may be associated with the pheno- 
type through population stratification. Ignoring this pos- 
sibility will not only inflate the type I error rate, but also 
will introduce noise after collapsing. By modeling these 
2 situations using « and y separately, formula (1) 
(below) manages to detect the association, accounting 
for neutral RVs as well as population stratification. 

Although it is less common to observe LD between 
RVs compared to common variants, only independent 
representative SNPs are selected and included in the 
analysis. Because 2 loci at most are taken into account 
for each individual, such selection improves the accuracy 
and efficiency of the model. 

Prior distributions for random effects 

The multiplicative relation in the 2 pairs (i.e.. Pi and a, 
P2 and y) may result in a nonidentifiable model. To 
ensure identifiability, a and y -except for ao and yo, 
which are random effects of the pseudolocus and 
sampled from Bem{0)-zre selected to be independently 
sampled from Bernoulli distributions with hyper-para- 
meters /?Ar and qi;- Pi and P2 are given a noninformative 
normal prior distribution with some variance ct^, 
that is. 



Yk ■ 



■ Bern{pk), 
Bem{cfk), 



(Ik ■ 



■Beta{l, 1) 
Be£fl(l, 1) 



(2) 



The k variant is treated as associated when = 1; 
otherwise it is neutral. The hyperparameter p/^ is the 
predictor for a/^ and can be regarded as the probability 
of the A:* variant being associated. With such a prior 
distribution for the model actually selects the opti- 
mal group of associated RVs in a data-driven way and 
then collapses them together. 



Bayesian inference 

To investigate the gene-level association, we wish to test 
the hypothesis Pi = 0. However, in a Bayesian framework, 
this hypothesis cannot be evaluated directly because the 
posterior distribution of Pi is continuous. Instead, we can 
conduct a composite hypothesis test: 



Ho:Wi\ € or X!Li«* = 0 

au > 0, 

k=i 



(3) 



Where e is a small positive number. Although in princi- 
ple the choice of e is arbitrary, a too small e might inflate 
the estimate error resulting from the numerical approxi- 
mation. So we set e as 0.2 * a^, where is the estimated 
standard deviation for random error. The Bayes factor 
(BF) is a good way to summarize the evidence provided by 
the data in favor of one statistical model over another 
while also taking into account the complexity of a model. 
Note that the BF can be expressed using the ratio of the 
odds of the posterior distribution, which can be obtained 
approximately by the Monte Carlo Markov chain 
(MCMC) method, to the prior odds. For the prior distri- 
bution described above, the prior odds are calculated as 



erf 



(4) 



■0.5'' 



where erf (•) is the error function, defined as: 
erf{x) = —= \ e~'^dt. Thus, the hybrid BF can be 
obtained by 

„p,„ P( 1 > ^ n "k > 0\Data] , P{\fli\ x n at > 0) 

iSt (til • tiQ) - -7 / 



P(|;6i| >e|Dflto) , P(|/ii| > 6) 

rSr (rii '. rici) = ~ / 

pm\<€\Datay pm\<^) 



(5) 



where P{\fii\ > e n ^^ ^afe > 0\Data) and 

P{\Pi \ < e U ttfe = 0 1 Data) are estimated from the 

posterior distribution approximated by the outputs of 
the MCMC method. If the BF exceeds a certain thresh- 
old, which is selected through simulations, we conclude 
that Pi is significant. Once there is evidence of a global 
association, we can further assess the underlying RVs by 
investigating the marginal posterior distribution for a;^ k 
e 1,..., K. Note that if we treat as a model indicator, 
one way to quantify and summarize the posterior prob- 
abilities is to calculate the marginal BF, which is the 
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Power curves 



oo 
o 



E 



Effect size 
• 1.5 



T 1 1 1 r 

0.2 0,4 0.6 0.8 1.0 

Proportion of associated RVs 

Figure 1 Empirical power comparison between different vaules of jSj. The red dashed line is the power curve for fii = 1. The black solid 
ine is the power curve for Pi = 1.5. 



ratio of the posterior odds to the prior odds of the same 
variable, defined as: 



BF(Mi(afc ^0):Mo(afc = 0)): 



P{ak y^O\Y)/P(ak 7^0) 
P{ak = 0\y)/Piak = 0) 



The model is implemented using WinBUGS with 
50,000 iterations, and the convergence is checked by 
investigating the autocorrelations for all parameters. We 
also simulate several chains with different initial values 
simultaneously, and evaluate convergence with the Gel- 
man-Rubin convergence diagnostic tool [13]. 

Results 

Unfortunately, for the Genetic Analysis Workshop 
(GAW)18 simulated data, only 275 trios can be incorpo- 
rated in our analyses, owing to the large number of par- 
ents with missing genotype. Most causal SNPs with 
MAP less than 0.01 do not present minor alleles in 
these 275 trios, so they are not suitable data for testing 
our method. Consequently, we investigated the perfor- 
mance of FBCM by generating a variety of simulation 
scenarios involving different effect sizes and proportions 
of associated RVs. In particular, we consider the settings 
in which 20% to 100% of RVs are associated. The total 
number of families, the offspring in each family, and the 
total number of RVs are fixed at 300, 2, and 50, 



respectively. To generate genotypic data for each family, 
a proportion of the RVs are randomly selected to be 
causal, represented by an indicator vector r. Half the 
RVs are randomly selected to be neutral but are asso- 
ciated with the phenotype through population stratifica- 
tion, represented by an indicator vector s. The genotypic 
scores of the parents are independently sampled from 
Bern{2 ■ MAF{k)) for the A:* RV, where MAF{k) is 
fixed as 0.005 throughout all RVs. Then the genotypes 
of children are obtained from parental haplotypes by 
random transmission, denoted by a 2 x 50 matrix G. G 
is divided into the expected genotypic score matrix E 
and the deviation matrix D for the offspring in a family. 
The phenotypes of the 2 sibs in each family are gener- 
ated from NiPi ♦ (£) x r) + ji^'iE x s), E), where jii is 
the effect size, ^2= 0.5 reflects the effect of population 

stratification, and E = ^ ^ 2 ^ covariance matrix to 

reflect the family structure. We set the hyperparameter 
cr^ as 10* and tuned the BF cutoff equal to 2 so that the 
type I error rate is controlled below 0.005. Figure 1 
shows the power curves with ^\ = 1 and 1.5. 

To evaluate FBCM using real data, the association 
analyses are performed by fitting our method to the 
data that use the full pedigree structure provided, with 
the entry for each variant being the estimated number 
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of minor alleles carried. Our aim is to identify the genes 
and underlying RVs related to systolic blood pressure 
(SBP) throughout those 11 chromosomes among the 
GAW18 type 2 diabetes families. To better reflect the 
association between predisposition to hypertension and 
the variants, the highest SBP measured at the 4 exami- 
nation points is selected as the phenotype for each indi- 
vidual. Log-transformation of the phenotype is 
performed to fix the skewness of the phenotype distri- 
bution. The age corresponding to the highest measured 
SBP is included in our model as a covariate to account 
for the significant correlation between age and SBP. 
Individuals with any missing data or without parental 
information are excluded, leaving 275 trios remaining. 

Using the gene information being obtained from 
Ensembl (http://www.ensembl.org/index.html), we inves- 
tigated the genes on all 11 chromosomes. For each 
gene, only variants with MAP of less than 1% within the 
boundary of the gene are included in the analyses. The 
MAFs are estimated using 959 individuals in the dosage 
genotype data. The results are summarized using a 
Manhattan plot in Figure 2, which shows the BFs (see 
formula (2)) for the 16,759 genes across these 11 chro- 
mosomes. Table 1 presents the most significantly asso- 
ciated genes, with their BFs, estimated effect sizes, 
HUGO Gene Nomenclature Committee (HGNC) sym- 
bols (http://www.genenames.org/), and Ensembl gene 



IDs. The effect size conveys the estimated magnitude of 
the relationship between SBP and the transmission 
deviation. Our results show that most genes have BFs of 
far less than 1, and given only 275 trios in the analyses 
such results are not surprising. However, we still identify 
several genes with a BF larger than 2, which is the cut- 
off obtained from the simulation. The evidence of the 
association is substantial for those BFs between 3 and 
10, based on Jeffery's grade of evidence [14], which is 
relatively subjective because BFs can be sensitive to 
many factors such as priors and number of RVs. More 
precise threshold values can be determined by permuta- 
tion within or between chains in MCMC method. 
Although the potential influence of RVs on SBP is elu- 
sive, previous studies have identified a handful of genes 
with common variants (MAF>1%) associated with SBP 
[15]. Our results indicate that several genes with under- 
lying RVs are potentially related to SBP and deserve to 
be further scrutinized. 

Next, we investigated the underlying SNPs among the 
most related gene, HNRNPA1P13. The most significant 
SNPs based on their BFs in formula (3) are listed at the 
bottom of Table 1. The MAF information on these 
SNPs comes from the 1000 Genomes Project (http:// 
www.1000genomes.org/). The larger BFs favor the evi- 
dence against the null hypothesis and indicate that positive 
deviation from the expected number of transmitted minor 
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Figure 2 Manhattan plot of BFs in the GAW18 study for the SBP. The horizontal axis shows the chromosomal start positions of the genes. 
The vertical axis shows the BFs of the genes. 
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alleles drives the effect of the gene, while the BFs much 
lower than 1 suggest the effect of the deviation in the 
opposite direction. For example, given the effect size of 
gene HNRNPA1P13 is 0.43, SNP at position 135765214 
with BP 0.00368 indicates that more transmitted copies of 
a minor allele from parents are likely to have a negative 
impact on the phenotype. 

Discussion 

The FBCM proposed here is a novel statistical method 
for analyzing the association of RVs in pedigree data. 
The new methodology accounts for potential nonasso- 
ciated variants by introducing random effects in a multi- 
plicative way to approximate and capture the variant 
effects. The FBCM also takes into account situations in 
which some pooled variants can be associated with a 
phenotype through population stratification. Although 
the between-family component in our model can be 
integrated into the family-level random effect (pi, when 
the RV under investigation shows a significant between- 
family effect, our model performs better by capturing 
this effect to reduce the residual error. The model is 
based on the HQTDT, but expands the HQTDT by 
incorporating the collapsing information of the deviation 
from the expected genotypic score for a group of SNPs 
and at the same time maintaining orthogonality. Unlike 
the model selection method [9], our model employs ran- 
dom effects to predict variant-specific effects based on 
data and succeeds in boosting the sensitivity for gene 
association detection by collapsing the random effects of 
RVs. By taking advantage of the rare occurrence of 
minor alleles in an individual, the algorithm considers at 
most 2 sites in order to reduce the number of predic- 
tors, circumventing the huge computational burden 
involved in obtaining the full posterior distribution. 

In variant-level analysis, our method improves the 
power to detect RV effects. The improvement of statisti- 
cal power can be achieved by accounting for the random 
effects of all variants in a candidate gene through Gibbs 
sampling. Moreover, in the GAW18 data, it has been 
shown that the occurrence of a RV tends to be more 
common if a family member carries a minor allele. 
Thus, the family-based analysis is expected to have 
more power than the independent population-based 
analysis. The results show that our family-based method 
is able to identify both genes and individual SNPs signif- 
icantly related to the phenotype, even in RV situations. 

Our model can be further expanded in many ways. 
The appropriate link functions can be employed to han- 
dle other forms of phenotype, such as binary data. In 
this study, we focus on families without any missing 
data. However, for the trios with missing parental geno- 
type information, the genetic score can be decomposed 
into between-family and within-family components by 



using only sibs genotypes. For the random effect distri- 
bution, the Bernoulli distribution is assigned as the prior 
distribution for the random effects of individual variants. 
For better modeling of the effects of individual variants, 
more sophisticated distributions can be employed. 

Conclusions 

We have demonstrated that a novel FBCM can be 
applied to identify associations between RVs and quanti- 
tative traits for pedigree data. This method cannot only 
detect the gene effect, but can also pinpoint the underly- 
ing SNPs. Compared to other methods for handling 
RVs, our method based on family data improves statisti- 
cal power by collapsing and accounting for all possible 
RV effects in a gene with population stratification con- 
trolled. Because the method allows for computational 
efficiency in obtaining the full posterior distribution, it 
is applicable to large-scale association tests. The results 
of our genome-wide analyses provide insights into the 
potential role of RVs in SBP. 
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